Skip to content

KVM: Fix agents dont reconnect post maintenance#3239

Merged
DaanHoogland merged 6 commits intoapache:4.11from
shapeblue:agentsdontreconnectpostmaintenance
May 23, 2019
Merged

KVM: Fix agents dont reconnect post maintenance#3239
DaanHoogland merged 6 commits intoapache:4.11from
shapeblue:agentsdontreconnectpostmaintenance

Conversation

@nvazquez
Copy link
Contributor

Description

Before this fix, there were two possible scenarios when cancelling maintenance/prepare for maintenance on a KVM host:

  • If global setting 'kvm.ssh.to.agent' = true, then the management server performed SSH into the host and restarted the CloudStack agent service.
  • If global setting 'kvm.ssh.to.agent' = false, then the management server required that the CloudStack agent service on the host was restarted manually, for the host to become operational again. Restart was to establish a new connection between management server and the host agent, as it was closed after the host is notified to be put into maintenance

After cancelling maintenance on one-time SSH password hosts, hosts did not reconnect and were not operational unless a manual restart on the CloudStack agent service was performed.

This feature keeps the connection between management server and host agent alive while preparing for maintenance and when on maintenance. This imples that:

  • Host agent is connected during maintenance period unless it is stopped
  • If the host or the agent are restarted during maintenance period, a new connection will be established between the agent and the management server.
  • If a host agent is connected to the management server when cancelling maintenance, then the current connection is kept alive regardless the value of the global setting 'kvm.ssh.to.agent'
  • If a host agent is disconnected when cancelling maintenance:
    • If 'kvm.ssh.to.agent' = true, then the management server restarts the agent service via SSH into the host
    • If 'kvm.ssh.to.agent' = false, then an error is thrown indicating that the agent must be connected to the management server.

Summary

  • When an admin cancels maintenance mode on a KVM host:
    • If 'kvm.ssh.to.agent' = false and the agent is connected then maintenance mode is cancelled, then maintenance mode is cancelled.
    • If 'kvm.ssh.to.agent' = true and the agent is connected, then maintenance mode is cancelled.
    • If 'kvm.ssh.to.agent' = true and the angent is not connected, then the management server will attempt to SSH into the host and restart the agent. If the agent connects, then maintenance mode is cancelled. If the agent still does not connect then maintenance mode fails to be cancelled and a suitable message is returned.
    • If 'kvm.ssh.to.agent' = false and the agent is not connected, then maintenance mode fails to be cancelled and a suitable message is returned
  • A host must be able to exit maintenance under the following circumstances:
    • Host agent has remained online throughout the maintenance period
    • Host agent has been restarted after host went into maintenance
    • KVM host has been fully restarted during the maintenance period
  • A host in maintenance mode and with agent connected must be shown to remain connected to CloudStack management whilst rejecting all CloudStack operations (new internal agent state = hostInMaintenance)
  • Hosts in which SSH is allowed (kvm.ssh.to.agent = true) must be able to still exit maintenance mode regardless of whether the host or agent have been restarted during the maintenance period. (i.e. no regressions to existing functionality)
  • Hosts in maintenance mode which have either or both the host itself or the host agent restarted should be reconnected to management server with the management server agent in the new internal hostInMaintenance state

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Tested on 2xKVM hosts environment, NFS primary and secondary storage, changing values of the global setting 'kvm.ssh.to.agent' for each case to test

Copy link
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but some changes requested.

@borisstoyanov
Copy link
Contributor

I'll wait @nvazquez to address comments and will trigger testing on this.

@nvazquez nvazquez requested review from GabrielBrascher and removed request for DaanHoogland March 27, 2019 11:14
@nvazquez
Copy link
Contributor Author

Thanks @rhtyd @borisstoyanov, comments addressed and re-tested functionalities.

@nvazquez
Copy link
Contributor Author

nvazquez commented Apr 4, 2019

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2680

@nvazquez nvazquez changed the title KVM: Fix agents dont reconnect post maintenance [WIP DO NOT MERGE] KVM: Fix agents dont reconnect post maintenance Apr 4, 2019
@nvazquez
Copy link
Contributor Author

nvazquez commented Apr 4, 2019

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@borisstoyanov
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@borisstoyanov a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3480)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 29394 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3239-t3480-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_privategw_acl.py
Intermittent failure detected: /marvin/tests/smoke/test_host_maintenance.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 66 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_cancel_host_maintenace_with_no_migration_jobs Failure 1.15 test_host_maintenance.py
test_02_cancel_host_maintenace_with_migration_jobs Error 3.53 test_host_maintenance.py
test_hostha_enable_ha_when_host_disabled Error 2.62 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 304.79 test_hostha_kvm.py

@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2719

@nvazquez
Copy link
Contributor Author

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3529)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33967 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3239-t3529-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Smoke tests completed. 65 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_04_extract_template Failure 128.43 test_templates.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
test_06_download_detached_volume Failure 137.92 test_volumes.py

@yadvr
Copy link
Member

yadvr commented May 1, 2019

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2725

@yadvr
Copy link
Member

yadvr commented May 1, 2019

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3542)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 24119 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3239-t3542-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_outofbandmanagement.py
Smoke tests completed. 68 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, manual testing has passed with the automated tests as well

@DaanHoogland
Copy link
Contributor

I see two approvals and a perfectly passed test suite. Is this still WIP?

@borisstoyanov
Copy link
Contributor

ping @nvazquez

@nvazquez nvazquez changed the title [WIP DO NOT MERGE] KVM: Fix agents dont reconnect post maintenance KVM: Fix agents dont reconnect post maintenance May 23, 2019
@nvazquez
Copy link
Contributor Author

Thanks @DaanHoogland @borisstoyanov, this feature is completed

@DaanHoogland DaanHoogland merged commit e86f671 into apache:4.11 May 23, 2019
}
try {
SSHCmdHelper.SSHCmdResult result = SSHCmdHelper.sshExecuteCmdOneShot(
connection, "service cloudstack-agent restart");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be changed to systemctl restart cloudstack-agent || service cloudstack-agent restart

@DaanHoogland
Copy link
Contributor

@rhtyd @nvazquez this one won't fwd-merge again due to moved files. I'll be looking at it later.

DaanHoogland added a commit that referenced this pull request May 23, 2019
* 4.11:
  KVM: Fix agents dont reconnect post maintenance (#3239)
DaanHoogland added a commit that referenced this pull request May 23, 2019
* 4.12:
  KVM: Fix agents dont reconnect post maintenance (#3239)
@DaanHoogland
Copy link
Contributor

done. simpler than thought

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants